長庚資管 曾意儒 Yi-Ju Tseng
caret package建立從輸入資料學習新資訊,變成智慧的演算法或資料模式,用來預測事件或協助決策
少或太髒的時候,資料探勘的效力會被影響資料探勘要派上用場,必須有以下條件:
學學這些模式/模型Supervised learning 監督式學習
Unsupervised learning 非監督式學習
在監督式學習中常見的資料探勘演算法如下:
在非監督式學習中常見的資料探勘演算法如下:
是否相關、相關方向與強度數學模型以便觀察特定變數來預測研究者感興趣的變數常見的迴歸分析演算法包括:
牙齒長度與維他命C劑量的線性迴歸觀察# install.packages("mlbench")
library(mlbench)
# The Effect of Vitamin C on Tooth Growth in Guinea Pigs
data("ToothGrowth")
knitr::kable(head(ToothGrowth))
| len | supp | dose |
|---|---|---|
| 4.2 | VC | 0.5 |
| 11.5 | VC | 0.5 |
| 7.3 | VC | 0.5 |
| 5.8 | VC | 0.5 |
| 6.4 | VC | 0.5 |
| 10.0 | VC | 0.5 |
lm()lm(formula,data=資料名稱),搭配formula使用牙齒長度與維他命C劑量的關係範例lm(len~dose,
data =ToothGrowth)
Call:
lm(formula = len ~ dose, data = ToothGrowth)
Coefficients:
(Intercept) dose
7.423 9.764
len = 9.764 * dose + 7.423
glm()lm()類似family="gaussian" 線性模型模型family="binomial" 邏輯迴歸模型family="poisson" 卜瓦松迴歸模型Gaussian distribution高斯函數是常態分布的密度函數
Binomial distribution二項分布是n個獨立的是/非試驗中成功的次數的離散機率分布
![]()
Poisson distribution次數分佈:
![]()
分析天竺鼠牙齒長度與維他命C劑量以及Supplement type的關係範例
glm(len ~ supp+dose,
data =ToothGrowth)
Call: glm(formula = len ~ supp + dose, data = ToothGrowth)
Coefficients:
(Intercept) suppVC dose
9.272 -3.700 9.764
Degrees of Freedom: 59 Total (i.e. Null); 57 Residual
Null Deviance: 3452
Residual Deviance: 1023 AIC: 348.4
len = -3.700 * suppVC + 9.764 * dose + 9.272
Supplement type的變項被轉為虛擬變項 Dummy Variableinstall.packages("mlbench")
library(mlbench)
data(BostonHousing)
總結以上,多變量線性迴歸分析有下列特色:
虛擬變項類別變項請記得轉成factor,R會自動建立虛擬變項依變數為連續變數,自變數為連續變數或虛擬變數的場合常用在依變數為二元變數(非0即1)的場合,如:
family="binomial" 羅吉斯迴歸模型library(mlbench)
data("PimaIndiansDiabetes2")
#去除有遺漏值的資料
PimaIndiansDiabetes2<-
PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
head(PimaIndiansDiabetes2)
| pregnant | glucose | pressure | triceps | insulin | mass | pedigree | age | diabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | neg |
| 5 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | pos |
| 7 | 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | pos |
| 9 | 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | pos |
| 14 | 1 | 189 | 60 | 23 | 846 | 30.1 | 0.398 | 59 | pos |
| 15 | 5 | 166 | 72 | 19 | 175 | 25.8 | 0.587 | 51 | pos |
得糖尿病與否與懷孕+BMI+血糖+triceps的羅吉斯迴歸觀察mylogit <-
glm(diabetes ~ pregnant+glucose+triceps+mass,
data = PimaIndiansDiabetes2,
family = "binomial")
sum<-summary(mylogit)
sum$coefficients
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -8.8083182 | 0.9733009 | -9.049944 | 0.0000000 |
| pregnant | 0.1433008 | 0.0406067 | 3.528997 | 0.0004171 |
| glucose | 0.0382373 | 0.0048203 | 7.932506 | 0.0000000 |
| triceps | 0.0189975 | 0.0165837 | 1.145551 | 0.2519809 |
| mass | 0.0635506 | 0.0253800 | 2.503965 | 0.0122810 |
install.packages("mlbench")
library(mlbench)
data(PimaIndiansDiabetes2)
PimaIndiansDiabetes2<-
PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
到底該用哪個模型來預測,模型配適度會最好?在迴歸模型中,常用的判斷準則包括:
AIC和BIC都是數值越小越好,以下建立三個模型,並比較其AIC
OneVar<-glm(diabetes ~ mass,
data = PimaIndiansDiabetes2,
family = "binomial")
TwoVar<-glm(diabetes ~ triceps+mass,
data = PimaIndiansDiabetes2,
family = "binomial")
ThreeVar<-glm(diabetes ~ glucose+triceps+mass,
data = PimaIndiansDiabetes2,
family = "binomial")
c(OneVar$aic,TwoVar$aic,ThreeVar$aic)
[1] 473.0310 470.5486 377.4528
sum2<-summary(TwoVar)
sum2$coefficients
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -3.5327637 | 0.5917333 | -5.970196 | 0.0000000 |
| triceps | 0.0298567 | 0.0141633 | 2.108036 | 0.0350278 |
| mass | 0.0574192 | 0.0215457 | 2.664997 | 0.0076989 |
sum3<-summary(ThreeVar)
sum3$coefficients
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -8.2515186 | 0.9307316 | -8.865626 | 0.0000000 |
| glucose | 0.0405379 | 0.0048409 | 8.374035 | 0.0000000 |
| triceps | 0.0266201 | 0.0162911 | 1.634024 | 0.1022537 |
| mass | 0.0469444 | 0.0245083 | 1.915454 | 0.0554347 |
樹狀目錄中建立一系列分割,以建立模型「節點」(Node)節點rpart packagesinstall.packages("rpart")
rpart()rpart(formula, data)library(rpart)
DT<-rpart(diabetes ~ pregnant+glucose+triceps+mass,
data=PimaIndiansDiabetes2)
DT
n= 392
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 392 130 neg (0.6683673 0.3316327)
2) glucose< 127.5 241 36 neg (0.8506224 0.1493776) *
3) glucose>=127.5 151 57 pos (0.3774834 0.6225166)
6) glucose< 165.5 105 52 pos (0.4952381 0.5047619)
12) triceps< 22.5 20 3 neg (0.8500000 0.1500000) *
13) triceps>=22.5 85 35 pos (0.4117647 0.5882353)
26) pregnant< 7.5 64 31 pos (0.4843750 0.5156250)
52) glucose>=134.5 49 20 neg (0.5918367 0.4081633)
104) mass< 30.25 11 2 neg (0.8181818 0.1818182) *
105) mass>=30.25 38 18 neg (0.5263158 0.4736842)
210) mass>=32.4 30 11 neg (0.6333333 0.3666667)
420) glucose>=141.5 22 6 neg (0.7272727 0.2727273) *
421) glucose< 141.5 8 3 pos (0.3750000 0.6250000) *
211) mass< 32.4 8 1 pos (0.1250000 0.8750000) *
53) glucose< 134.5 15 2 pos (0.1333333 0.8666667) *
27) pregnant>=7.5 21 4 pos (0.1904762 0.8095238) *
7) glucose>=165.5 46 5 pos (0.1086957 0.8913043) *
par(mfrow=c(1,1), mar = rep(1,4)) #下,左,上,右
plot(DT)
text(DT, use.n=F, all=F, cex=1)
改用rpart.plot package 裡面的prp()
#第一次使用前須先安裝
install.packages("rpart.plot")
library(rpart.plot)
prp(DT)
install.packages("mlbench")
library(mlbench)
data(PimaIndiansDiabetes2)
PimaIndiansDiabetes2<-
PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
因此,資料集可分為以下兩種:
學到知識學的怎麼樣Training setTest set選看起來最好的模型用Test set來驗證模型是不是真的很好
訓練模型時,只能看Training set,用Training set來選一個最好的模型
訓練模型時,不能偷看Test set,才是真正的驗證
Level 2 放正面答案–>有病/錄取…library(mlbench)
data("PimaIndiansDiabetes2")
PD2<-
PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
head(PD2$diabetes)
[1] neg pos pos pos pos pos
Levels: neg pos
為分出訓練組與測試組,需使用隨機抽樣的方式
sample(1:10,3) # 從1到10,隨機取三個數字
[1] 8 3 9
#從第一列到最後一列,隨機取1/3列數
sample(1:nrow(PD2),nrow(PD2)/3)
[1] 25 331 274 309 271 181 175 112 31 276 183 106 308 174 178 259 301
[18] 88 69 18 186 294 92 151 233 206 166 357 329 187 381 371 104 286
[35] 243 114 313 172 253 303 89 168 131 129 297 340 345 6 323 169 290
[52] 288 12 84 164 261 319 139 122 32 292 93 223 354 375 287 159 230
[69] 155 160 250 228 196 73 235 224 119 41 289 202 302 60 307 365 40
[86] 176 214 193 244 167 182 108 254 190 234 138 260 299 130 216 154 140
[103] 240 15 47 21 251 13 222 269 192 28 383 52 332 355 350 107 266
[120] 209 118 71 384 100 312 87 170 311 23 346
使用隨機抽樣法,選出1/3的元素位置,把病人的資料分成Training 和 Test set
PD2$Test<-F #新增一個參數紀錄分組
#隨機取1/3當Test set
PD2[sample(1:nrow(PD2),
nrow(PD2)/3),]$Test<-T
# Training set : Test set病人數
c(sum(PD2$Test==F),
sum(PD2$Test==T))
[1] 262 130
並用訓練組的資料(PD2$Test==F),訓練一個多變數羅吉斯迴歸模型
fit<-glm(diabetes~.,
data =PD2[PD2$Test==F,],
family = "binomial")
summary(fit)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.991605928 1.393592413 -6.4521060 1.103064e-10
pregnant 0.082080982 0.070552922 1.1633959 2.446689e-01
glucose 0.043457330 0.007165607 6.0647098 1.321921e-09
pressure -0.008647776 0.014240299 -0.6072748 5.436685e-01
triceps 0.014579728 0.019977269 0.7298159 4.655027e-01
insulin -0.002866650 0.001704889 -1.6814286 9.267969e-02
mass 0.049857334 0.030882145 1.6144388 1.064323e-01
pedigree 0.790905290 0.486245205 1.6265565 1.038313e-01
age 0.031410440 0.022139975 1.4187207 1.559805e-01
逐步選擇模型 stepwise 後退學習:一開始先將所有參數加到模型裡,再一個一個拿掉
library(MASS)
##根據AIC,做逐步選擇, 預設倒退學習 direction = "backward"
##trace=FALSE: 不要顯示步驟
finalModel_B<-
stepAIC(fit,
direction = "backward",
trace=FALSE)
summary(finalModel_B)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.452544171 1.303332306 -7.252597 4.088548e-13
glucose 0.043271412 0.007124181 6.073879 1.248568e-09
insulin -0.002837212 0.001697969 -1.670945 9.473258e-02
mass 0.053081435 0.024029795 2.208984 2.717575e-02
pedigree 0.753094062 0.476858475 1.579282 1.142714e-01
age 0.047288876 0.016888888 2.799999 5.110271e-03
逐步選擇模型 stepwise 雙向學習:參數加加減減
##根據AIC,做逐步選擇, 雙向學習 direction = "both"
finalModel_Both<-
stepAIC(fit,
direction = "both",
trace=FALSE)
summary(finalModel_Both)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.452544171 1.303332306 -7.252597 4.088548e-13
glucose 0.043271412 0.007124181 6.073879 1.248568e-09
insulin -0.002837212 0.001697969 -1.670945 9.473258e-02
mass 0.053081435 0.024029795 2.208984 2.717575e-02
pedigree 0.753094062 0.476858475 1.579282 1.142714e-01
age 0.047288876 0.016888888 2.799999 5.110271e-03
predictPoint<-
predict(finalModel_Both,
PD2[PD2$Test==T,],
type = "response")
DMAns<-factor(ifelse(predictPoint>0.5,"pos","neg"),
levels=c("pos","neg"))
head(DMAns)
4 5 14 17 25 28
pos pos neg pos neg pos
Levels: pos neg
table(x=DMAns,
y=PD2[PD2$Test==T,"diabetes"])
y
x neg pos
pos 5 33
neg 74 18
plot(x=PD2[PD2$Test==T,"diabetes"],
y=predictPoint)
當答案是二元時:
當答案是二元且輸出也為二元時:效能指標公式
真的有病的人,被預測有病的比例真的沒病的人,被預測沒病的比例預測有病的人,真的有病的比例預測沒病的人,真的沒病的比例回想一下剛剛的驗證結果
table(x=DMAns,
y=PD2[PD2$Test==T,"diabetes"])
y
x neg pos
pos 5 33
neg 74 18
計算預測效能參數
predictPoint<-
predict(finalModel_Both,
PD2[PD2$Test==T,],
type = "response")
DMAns<-factor(ifelse(predictPoint>0.5,"pos","neg"),
levels=c("pos","neg"))
利用caret套件,計算預測效能參數
# install.packages("caret") #計算參數的packages
library(caret)
sensitivity(DMAns,
PD2[PD2$Test==T,"diabetes"],
positive="pos")
[1] 0.6470588
specificity(DMAns,
PD2[PD2$Test==T,"diabetes"],
negative="neg")
[1] 0.9367089
計算預測效能參數
posPredValue(DMAns,
PD2[PD2$Test==T,"diabetes"],
positive="pos")
[1] 0.8684211
negPredValue(DMAns,
PD2[PD2$Test==T,"diabetes"],
negative="neg")
[1] 0.8043478
當答案是二元,且模型輸出為連續值(如機率)時:

計算預測效能參數 AUC,使用pROC
# install.packages("pROC")
library(pROC)
# Draw ROC curve.
result.roc <-
roc(PD2[PD2$Test==T,"diabetes"], predictPoint)
# youden or closest.topleft
coords(result.roc, "best",
best.method="closest.topleft",
ret=c("threshold", "accuracy"))
threshold accuracy
0.2977053 0.8153846
計算預測效能參數 AUC,使用pROC
plot(result.roc, print.thres="best",
print.thres.best.method="closest.topleft")
計算預測效能參數 AUC,使用ROCR
library(ROCR)
pred=prediction(predictPoint,
PD2[PD2$Test==T,"diabetes"])
#Calculate the AUC value
perf_AUC=performance(pred,"auc")
AUC=perf_AUC@y.values[[1]]
計算預測效能參數 AUC,使用ROCR
perf_ROC=performance(pred,"tpr","fpr")
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",
format(AUC, digits=5,
scientific=FALSE)))
With ggplot and pROC: https://gist.github.com/jwaage/6d8f4eb096e4f18a0894ca1ce27af834
library(mlbench) data(Sonar)
http://topepo.github.io/caret/index.html
createDataPartition
library(caret)
set.seed(3456)
trainIndex <-
createDataPartition(iris$Species,
p = .8,
list = FALSE,
times = 1)
head(trainIndex)
Resample1
[1,] 1
[2,] 2
[3,] 4
[4,] 5
[5,] 6
[6,] 8
使用createDataPartition的輸出做子集
trainData<-iris[trainIndex,]
testData<-iris[-trainIndex,]
nrow(trainData)
[1] 120
nrow(testData)
[1] 30
trainControl設定模型參數訓練法
fitControl <-
trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)

method = “rpart” 決策樹
set.seed(825)
DTFit1 <-
train(Species ~ ., data = trainData,
method = "rpart",
trControl = fitControl)
DTFit1
CART
120 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.000 0.9166667 0.875
0.425 0.7833333 0.675
0.500 0.3333333 0.000
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.
predict(DTFit1,
newdata = testData,
type = "prob")
setosa versicolor virginica
3 1 0.00000000 0.0000000
7 1 0.00000000 0.0000000
12 1 0.00000000 0.0000000
14 1 0.00000000 0.0000000
21 1 0.00000000 0.0000000
23 1 0.00000000 0.0000000
41 1 0.00000000 0.0000000
44 1 0.00000000 0.0000000
48 1 0.00000000 0.0000000
49 1 0.00000000 0.0000000
54 0 0.88636364 0.1136364
55 0 0.88636364 0.1136364
61 0 0.88636364 0.1136364
64 0 0.88636364 0.1136364
70 0 0.88636364 0.1136364
82 0 0.88636364 0.1136364
91 0 0.88636364 0.1136364
95 0 0.88636364 0.1136364
96 0 0.88636364 0.1136364
100 0 0.88636364 0.1136364
101 0 0.02777778 0.9722222
102 0 0.02777778 0.9722222
109 0 0.02777778 0.9722222
110 0 0.02777778 0.9722222
111 0 0.02777778 0.9722222
125 0 0.02777778 0.9722222
129 0 0.02777778 0.9722222
138 0 0.02777778 0.9722222
147 0 0.02777778 0.9722222
149 0 0.02777778 0.9722222
createDataPartition
library(caret)
set.seed(3456)
trainIndexBinary <-
createDataPartition(PD2$diabetes,
p = .75,
list = FALSE,
times = 1)
head(trainIndexBinary)
Resample1
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
createDataPartition
trainDataBinary<-PD2[trainIndexBinary,]
testDataBinary<-PD2[-trainIndexBinary,]
fitControlBinary <-
trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
set.seed(825)
DTFitBinary <-
train(diabetes ~ .,
data = trainDataBinary,
method = "rpart",
trControl = fitControlBinary,
## Specify which metric to optimize
metric = "ROC")
DTFitBinary
CART
295 samples
9 predictor
2 classes: 'neg', 'pos'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 266, 266, 265, 266, 265, 265, ...
Resampling results across tuning parameters:
cp ROC Sens Spec
0.02265372 0.7876810 0.8688684 0.6184545
0.11650485 0.7348005 0.8486842 0.6284545
0.37864078 0.5977380 0.8956579 0.2998182
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.02265372.
pred<-
predict(DTFitBinary,
newdata = testDataBinary,
type = "prob")
knitr::kable(head(pred))
| neg | pos | |
|---|---|---|
| 21 | 0.8516484 | 0.1483516 |
| 41 | 0.2371134 | 0.7628866 |
| 44 | 0.2371134 | 0.7628866 |
| 52 | 0.8516484 | 0.1483516 |
| 53 | 0.8516484 | 0.1483516 |
| 55 | 0.8750000 | 0.1250000 |
predClass<-
ifelse(pred$pos>0.5,"pos","neg")
confusionMatrix(data = predClass,
reference = testDataBinary$diabetes)
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 54 10
pos 9 24
Accuracy : 0.8041
95% CI : (0.7111, 0.8778)
No Information Rate : 0.6495
P-Value [Acc > NIR] : 0.0006531
Kappa : 0.5669
Mcnemar's Test P-Value : 1.0000000
Sensitivity : 0.8571
Specificity : 0.7059
Pos Pred Value : 0.8438
Neg Pred Value : 0.7273
Prevalence : 0.6495
Detection Rate : 0.5567
Detection Prevalence : 0.6598
Balanced Accuracy : 0.7815
'Positive' Class : neg
library(mlbench) data(Sonar)